Skip to content

Use GET with comma-separated values for multi-value waterdata queries#233

Merged
thodson-usgs merged 1 commit into
DOI-USGS:mainfrom
thodson-usgs:fix-waterdata-get-multivalue
May 15, 2026
Merged

Use GET with comma-separated values for multi-value waterdata queries#233
thodson-usgs merged 1 commit into
DOI-USGS:mainfrom
thodson-usgs:fix-waterdata-get-multivalue

Conversation

@thodson-usgs
Copy link
Copy Markdown
Collaborator

@thodson-usgs thodson-usgs commented Apr 14, 2026

Route most multi-value waterdata kwargs through GET with comma-joined values instead of POST+CQL2. The OGC API supports ?parameter_code=00060,00010 for all collections except monitoring-locations, which still rejects comma-separated values and so retains the POST+CQL2 path.

Service-scoped routing

A new _CQL2_REQUIRED_SERVICES = frozenset({"monitoring-locations"}) constant picks the path. When the Water Data APIs team enables comma-separated support on monitoring-locations, emptying this set drops the POST branch in one edit.

Date formatting moved before routing

_format_api_dates now runs once at the top of _construct_api_requests, so both routing paths consume an already-formatted ISO8601 string rather than the original list shape.

Closes #210

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Water Data OGC request builder to use comma-separated GET parameters for multi-value queries where supported, reducing reliance on POST+CQL2 while preserving POST behavior for monitoring-locations.

Changes:

  • Adds _CQL2_REQUIRED_SERVICES to keep monitoring-locations on POST+CQL2.
  • Formats date/time parameters before choosing GET vs POST.
  • Adds unit tests for multi-value GET, monitoring-location POST fallback, scalar handling, numeric list joining, and date interval formatting.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
dataretrieval/waterdata/utils.py Updates request construction logic to comma-join list parameters for GET except for CQL2-required services.
tests/waterdata_test.py Adds tests covering the updated request construction behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@thodson-usgs thodson-usgs marked this pull request as ready for review May 15, 2026 19:34
The OGC Water Data API supports comma-separated multi-value parameters
for most collections, so list-valued kwargs (e.g. parameter_code=
["00060", "00010"]) can now be sent as a single GET request instead of
a POST+CQL2 body. The monitoring-locations collection still rejects
comma-separated values and so retains the POST+CQL2 path; a new
_CQL2_REQUIRED_SERVICES frozenset is the single removal point for
when the upstream API closes that gap.

Date/time parameters are now formatted to ISO8601 once at the top of
_construct_api_requests ahead of the routing decision, so both paths
see a post-formatted string instead of the original list shape.

Tests cover the GET multi-value path, the POST+CQL2 path (including
the full CQL2 envelope shape), single-value scalars, numeric lists,
and the two-element date-list interval case.

Closes DOI-USGS#210
@thodson-usgs thodson-usgs force-pushed the fix-waterdata-get-multivalue branch from a432be8 to 40ed9eb Compare May 15, 2026 19:36
@thodson-usgs thodson-usgs merged commit 36866a0 into DOI-USGS:main May 15, 2026
8 checks passed
thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request May 17, 2026
For multi-value waterdata queries (e.g. monitoring_location_id with
~300+ sites), the GET URL produced by PR DOI-USGS#233 blows past the server's
~8 KB nginx buffer and the API returns HTTP 414. This PR adds a
chunker that transparently splits long list params across sub-requests
so each URL fits the byte budget.

The chunker is a decorator applied to ``_fetch_once`` outside the
existing ``@filters.chunked`` (CQL chunker), so list-chunking is the
outer loop and filter-chunking is the inner loop:

  @chunking.multi_value_chunked(build_request=_construct_api_requests)
  @filters.chunked(build_request=_construct_api_requests)
  def _fetch_once(args): ...

Key design points:

- ``_plan_chunks`` greedy-halves the largest chunk across all
  dimensions until the worst-case sub-request fits ``url_limit``
  (URL + body, via ``_request_bytes``, so POST routes are sized
  correctly). Cartesian product of per-dim partitions becomes the
  sub-request set; capped at ``max_chunks=1000``.

- ``_filter_aware_probe_args`` coordinates with ``filters.chunked``:
  the planner probes URL length using a synthetic clause that matches
  the inner filter chunker's bail-floor size (longest single clause,
  scaled by worst-case URL encoding ratio). Without this coordination,
  the outer planner would raise ``RequestTooLarge`` on combinations
  the stacked chunkers can actually handle.

- ``QuotaExhausted`` mid-call guard reads ``x-ratelimit-remaining``
  after each sub-request; if it drops below ``quota_safety_floor=50``,
  the wrapper raises with the partial frame, completed-chunk offset,
  and last observed remaining quota — letting callers salvage or
  resume after the rate-limit window resets, rather than crash into a
  silent mid-pagination 429.

- ``RequestTooLarge`` is raised when the smallest reducible plan
  still exceeds ``url_limit`` (every multi-value param at a singleton
  chunk and any chunkable filter at the inner chunker's bail floor)
  or when the cartesian product exceeds ``max_chunks``.

- All defaults (``url_limit``, ``max_chunks``, ``quota_safety_floor``)
  resolve at call time, so monkey-patching ``filters._WATERDATA_URL_
  BYTE_LIMIT`` for tests / non-default quotas affects the decorator
  uniformly.

Public additions:

- ``dataretrieval.waterdata.chunking.multi_value_chunked``
- ``dataretrieval.waterdata.chunking.RequestTooLarge``
- ``dataretrieval.waterdata.chunking.QuotaExhausted`` (carries
  ``partial_frame``, ``partial_response``, ``completed_chunks``,
  ``total_chunks``, ``remaining``)

Tests (30 new):

- ``_filter_aware_probe_args`` worst-case-clause modelling
- ``_plan_chunks`` greedy halving, RequestTooLarge floor, filter-
  chunker coordination, ``max_chunks`` cap, lazy-default reads
- ``multi_value_chunked`` pass-through, cartesian-product shape,
  end-to-end with stacked filter chunker
- ``QuotaExhausted`` header parsing, mid-call abort, last-chunk no-
  abort, zero-floor disable
- ``RequestTooLarge`` message contents and triggering conditions

End-to-end correctness verified against the live API: identical
per-site cell-for-cell output between unchunked (single call) and
chunked (forced fan-out via patched limit) paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request May 17, 2026
For multi-value waterdata queries (e.g. monitoring_location_id with
~300+ sites), the GET URL produced by PR DOI-USGS#233 blows past the server's
~8 KB nginx buffer and the API returns HTTP 414. This PR adds a
chunker that transparently splits long list params across sub-requests
so each URL fits the byte budget.

The chunker is a decorator applied to ``_fetch_once`` outside the
existing ``@filters.chunked`` (CQL chunker), so list-chunking is the
outer loop and filter-chunking is the inner loop:

  @chunking.multi_value_chunked(build_request=_construct_api_requests)
  @filters.chunked(build_request=_construct_api_requests)
  def _fetch_once(args): ...

Key design points:

- ``_plan_chunks`` greedy-halves the largest chunk across all
  dimensions until the worst-case sub-request fits ``url_limit``
  (URL + body, via ``_request_bytes``, so POST routes are sized
  correctly). Cartesian product of per-dim partitions becomes the
  sub-request set; capped at ``max_chunks=1000``.

- ``_filter_aware_probe_args`` coordinates with ``filters.chunked``:
  the planner probes URL length using a synthetic clause that matches
  the inner filter chunker's bail-floor size (longest single clause,
  scaled by worst-case URL encoding ratio). Without this coordination,
  the outer planner would raise ``RequestTooLarge`` on combinations
  the stacked chunkers can actually handle.

- ``QuotaExhausted`` mid-call guard reads ``x-ratelimit-remaining``
  after each sub-request; if it drops below ``quota_safety_floor=50``,
  the wrapper raises with the partial frame, completed-chunk offset,
  and last observed remaining quota — letting callers salvage or
  resume after the rate-limit window resets, rather than crash into a
  silent mid-pagination 429.

- ``RequestTooLarge`` is raised when the smallest reducible plan
  still exceeds ``url_limit`` (every multi-value param at a singleton
  chunk and any chunkable filter at the inner chunker's bail floor)
  or when the cartesian product exceeds ``max_chunks``.

- All defaults (``url_limit``, ``max_chunks``, ``quota_safety_floor``)
  resolve at call time, so monkey-patching ``filters._WATERDATA_URL_
  BYTE_LIMIT`` for tests / non-default quotas affects the decorator
  uniformly.

Public additions:

- ``dataretrieval.waterdata.chunking.multi_value_chunked``
- ``dataretrieval.waterdata.chunking.RequestTooLarge``
- ``dataretrieval.waterdata.chunking.QuotaExhausted`` (carries
  ``partial_frame``, ``partial_response``, ``completed_chunks``,
  ``total_chunks``, ``remaining``)

Tests (30 new):

- ``_filter_aware_probe_args`` worst-case-clause modelling
- ``_plan_chunks`` greedy halving, RequestTooLarge floor, filter-
  chunker coordination, ``max_chunks`` cap, lazy-default reads
- ``multi_value_chunked`` pass-through, cartesian-product shape,
  end-to-end with stacked filter chunker
- ``QuotaExhausted`` header parsing, mid-call abort, last-chunk no-
  abort, zero-floor disable
- ``RequestTooLarge`` message contents and triggering conditions

End-to-end correctness verified against the live API: identical
per-site cell-for-cell output between unchunked (single call) and
chunked (forced fan-out via patched limit) paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request May 17, 2026
For multi-value waterdata queries (e.g. monitoring_location_id with
~300+ sites), the GET URL produced by PR DOI-USGS#233 blows past the server's
~8 KB nginx buffer and the API returns HTTP 414. This PR adds a
chunker that transparently splits long list params across sub-requests
so each URL fits the byte budget.

The chunker is a decorator applied to ``_fetch_once`` outside the
existing ``@filters.chunked`` (CQL chunker), so list-chunking is the
outer loop and filter-chunking is the inner loop:

  @chunking.multi_value_chunked(build_request=_construct_api_requests)
  @filters.chunked(build_request=_construct_api_requests)
  def _fetch_once(args): ...

Key design points:

- ``_plan_chunks`` greedy-halves the largest chunk across all
  dimensions until the worst-case sub-request fits ``url_limit``
  (URL + body, via ``_request_bytes``, so POST routes are sized
  correctly). Cartesian product of per-dim partitions becomes the
  sub-request set; capped at ``max_chunks=1000``.

- ``_filter_aware_probe_args`` coordinates with ``filters.chunked``:
  the planner probes URL length using a synthetic clause that matches
  the inner filter chunker's bail-floor size (longest single clause,
  scaled by worst-case URL encoding ratio). Without this coordination,
  the outer planner would raise ``RequestTooLarge`` on combinations
  the stacked chunkers can actually handle.

- ``QuotaExhausted`` mid-call guard reads ``x-ratelimit-remaining``
  after each sub-request; if it drops below ``quota_safety_floor=50``,
  the wrapper raises with the partial frame, completed-chunk offset,
  and last observed remaining quota — letting callers salvage or
  resume after the rate-limit window resets, rather than crash into a
  silent mid-pagination 429.

- ``RequestTooLarge`` is raised when the smallest reducible plan
  still exceeds ``url_limit`` (every multi-value param at a singleton
  chunk and any chunkable filter at the inner chunker's bail floor)
  or when the cartesian product exceeds ``max_chunks``.

- All defaults (``url_limit``, ``max_chunks``, ``quota_safety_floor``)
  resolve at call time, so monkey-patching ``filters._WATERDATA_URL_
  BYTE_LIMIT`` for tests / non-default quotas affects the decorator
  uniformly.

Public additions:

- ``dataretrieval.waterdata.chunking.multi_value_chunked``
- ``dataretrieval.waterdata.chunking.RequestTooLarge``
- ``dataretrieval.waterdata.chunking.QuotaExhausted`` (carries
  ``partial_frame``, ``partial_response``, ``completed_chunks``,
  ``total_chunks``, ``remaining``)

Tests (30 new):

- ``_filter_aware_probe_args`` worst-case-clause modelling
- ``_plan_chunks`` greedy halving, RequestTooLarge floor, filter-
  chunker coordination, ``max_chunks`` cap, lazy-default reads
- ``multi_value_chunked`` pass-through, cartesian-product shape,
  end-to-end with stacked filter chunker
- ``QuotaExhausted`` header parsing, mid-call abort, last-chunk no-
  abort, zero-floor disable
- ``RequestTooLarge`` message contents and triggering conditions

End-to-end correctness verified against the live API: identical
per-site cell-for-cell output between unchunked (single call) and
chunked (forced fan-out via patched limit) paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request May 17, 2026
For multi-value waterdata queries (e.g. monitoring_location_id with
~300+ sites), the GET URL produced by PR DOI-USGS#233 blows past the server's
~8 KB nginx buffer and the API returns HTTP 414. This PR adds a
chunker that transparently splits long list params across sub-requests
so each URL fits the byte budget.

The chunker is a decorator applied to ``_fetch_once`` outside the
existing ``@filters.chunked`` (CQL chunker), so list-chunking is the
outer loop and filter-chunking is the inner loop:

  @chunking.multi_value_chunked(build_request=_construct_api_requests)
  @filters.chunked(build_request=_construct_api_requests)
  def _fetch_once(args): ...

Key design points:

- ``_plan_chunks`` greedy-halves the largest chunk across all
  dimensions until the worst-case sub-request fits ``url_limit``
  (URL + body, via ``_request_bytes``, so POST routes are sized
  correctly). Cartesian product of per-dim partitions becomes the
  sub-request set; capped at ``max_chunks=1000``.

- ``_filter_aware_probe_args`` coordinates with ``filters.chunked``:
  the planner probes URL length using a synthetic clause that matches
  the inner filter chunker's bail-floor size (longest single clause,
  scaled by worst-case URL encoding ratio). Without this coordination,
  the outer planner would raise ``RequestTooLarge`` on combinations
  the stacked chunkers can actually handle.

- ``QuotaExhausted`` mid-call guard reads ``x-ratelimit-remaining``
  after each sub-request; if it drops below ``quota_safety_floor=50``,
  the wrapper raises with the partial frame, completed-chunk offset,
  and last observed remaining quota — letting callers salvage or
  resume after the rate-limit window resets, rather than crash into a
  silent mid-pagination 429.

- ``RequestTooLarge`` is raised when the smallest reducible plan
  still exceeds ``url_limit`` (every multi-value param at a singleton
  chunk and any chunkable filter at the inner chunker's bail floor)
  or when the cartesian product exceeds ``max_chunks``.

- All defaults (``url_limit``, ``max_chunks``, ``quota_safety_floor``)
  resolve at call time, so monkey-patching ``filters._WATERDATA_URL_
  BYTE_LIMIT`` for tests / non-default quotas affects the decorator
  uniformly.

Public additions:

- ``dataretrieval.waterdata.chunking.multi_value_chunked``
- ``dataretrieval.waterdata.chunking.RequestTooLarge``
- ``dataretrieval.waterdata.chunking.QuotaExhausted`` (carries
  ``partial_frame``, ``partial_response``, ``completed_chunks``,
  ``total_chunks``, ``remaining``)

Tests (30 new):

- ``_filter_aware_probe_args`` worst-case-clause modelling
- ``_plan_chunks`` greedy halving, RequestTooLarge floor, filter-
  chunker coordination, ``max_chunks`` cap, lazy-default reads
- ``multi_value_chunked`` pass-through, cartesian-product shape,
  end-to-end with stacked filter chunker
- ``QuotaExhausted`` header parsing, mid-call abort, last-chunk no-
  abort, zero-floor disable
- ``RequestTooLarge`` message contents and triggering conditions

End-to-end correctness verified against the live API: identical
per-site cell-for-cell output between unchunked (single call) and
chunked (forced fan-out via patched limit) paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expand usage of GET calls in waterdata module

2 participants